fix(results): handle replay row schema aliases by christso · Pull Request #1400 · EntityProcess/agentv

christso · 2026-06-17T09:31:37Z

Summary

Historical replay-shaped result rows now hydrate consistently across results and inspect: failed rows keep their testId, timing/token fields are preserved, and inspect filter --has-tool sees trace.toolCalls aliases.

The accepted results command input remains a run workspace or index.jsonl manifest. Canonical rows are still snake_case; this PR normalizes only the known historical camelCase aliases at the file boundary, then rejects eval-case-only JSONL with migration guidance instead of treating it as an unknown failed result.

Validation

bun test apps/cli/test/commands/results
bun test apps/cli/test/commands/eval/artifact-writer.test.ts
bun test packages/core/test/evaluation/trace-summary.test.ts
bun run typecheck
bun run lint
Push hook reran bun run typecheck and bun run lint successfully.

Red/Green UAT

Red, before the fix on the same replay-shaped row:

results summary returned failed_test_ids: ["unknown"], total_duration_ms: 0, and total_tokens: 0.
inspect filter --has-tool rg returned [].

Green, after the fix:

bun apps/cli/src/cli.ts results summary apps/cli/test/fixtures/results/camel-replay returns failed_test_ids: ["wtg-replay-fail"], total_duration_ms: 1234, and total_tokens: 15.
bun apps/cli/src/cli.ts inspect filter apps/cli/test/fixtures/results/camel-replay/index.jsonl --has-tool rg --format json returns the wtg-replay-fail row with tool_names: ["rg"].
Eval-case-only JSONL is rejected with: Unsupported result row ... Eval-case JSONL is input data, not a results artifact. Run agentv eval <eval-file> --output <run-dir>....

Post-Deploy Monitoring & Validation

No additional production monitoring required; this changes local CLI parsing of result artifacts only. After release, validate by running agentv results summary and agentv inspect filter --has-tool against a historical replay manifest if one is available. Watch CI and user reports for Unsupported result row, failed_test_ids containing unknown, or tool filters failing to match known trace summaries.

christso · 2026-06-17T09:59:26Z

Review Findings

P1 - Adjacent trace tests fail after the invalid-score error changed

apps/cli/src/commands/inspect/utils.ts:137 now routes standalone JSONL loading through normalizeResultRow(), so rows with a missing or non-numeric score throw Unsupported result row ... Expected an AgentV result row... instead of the prior Missing or invalid score... message. The existing trace utility contract at apps/cli/test/commands/trace/trace.test.ts:191 and apps/cli/test/commands/trace/trace.test.ts:198 still asserts the old message, and bun test apps/cli/test/commands/trace/trace.test.ts now fails those two tests. CI runs the full test suite, so this should block merge until the trace tests are updated to the new migration guidance or loadJsonlRecords() preserves the trace-specific invalid-score error.

Verification

bun install and bun run build were needed first because this fresh worktree lacked workspace package dist output.
bun test apps/cli/test/commands/results passed: 193 pass.
bun test apps/cli/test/commands/eval/artifact-writer.test.ts passed: 45 pass.
bun run typecheck passed.
bun apps/cli/src/cli.ts results summary apps/cli/test/fixtures/results/camel-replay returned failed_test_ids: ["wtg-replay-fail"], total_duration_ms: 1234, and total_tokens: 15.
bun apps/cli/src/cli.ts inspect filter apps/cli/test/fixtures/results/camel-replay/index.jsonl --has-tool rg --format json returned the replay row with tool_names: ["rg"].
Adjacent check bun test apps/cli/test/commands/trace/trace.test.ts failed: 47 pass, 2 fail, both invalid-score message expectations.

Note: gh pr review --request-changes was rejected by GitHub because the authenticated user cannot request changes on their own pull request, so I am leaving this as a blocking review comment.

cloudflare-workers-and-pages · 2026-06-17T10:47:17Z

Deploying agentv with Cloudflare Pages

Latest commit:	`239ab03`
Status:	✅ Deploy successful!
Preview URL:	https://1c0dafa8.agentv.pages.dev
Branch Preview URL:	https://av-results-row-schema-cleanu.agentv.pages.dev

View logs

christso · 2026-06-17T10:47:44Z

Addressed the blocking trace-test comment in c234fd04.

Decision: preserved the trace-specific invalid-score contract for standalone trace/result JSONL loaded through loadResultFile(). That path accepts trace exports, so Missing or invalid score... is more precise there than the result-artifact migration guidance. Eval-case-only rows still go through the shared result-row schema error and keep the migration guidance.

Validation:

Reproduced before fix: bun test apps/cli/test/commands/trace/trace.test.ts failed 2 invalid-score expectations.
bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass.
bun test apps/cli/test/commands/results — 193 pass.
bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 45 pass.
bun run typecheck — pass.
bun run lint — pass.

christso · 2026-06-17T10:56:04Z

CI Test job follow-up after c234fd0 / failed job https://github.com/EntityProcess/agentv/actions/runs/27683502887/job/81876531953

Root cause from the log:

apps/cli/test/commands/compare/compare.test.ts still expected the result/trace-specific Missing or invalid score contract, but invalid result-shaped rows were being rejected by the shared migration-guidance schema error before compare's own validation ran.
apps/cli/test/commands/results/shared.test.ts built the camel replay fixture path from process.cwd(), which becomes apps/cli under the CLI package test job and produced apps/cli/apps/cli/....

Fix in 239ab033:

normalizeResultRow now keeps eval-case-only rows on the migration-guidance error, while result-shaped rows with missing/non-finite score keep the existing invalid-score schema error.
Nested trace/output normalization no longer writes undefined fields onto rows, so eval-case rows do not get misclassified as result-like.
The camel replay fixture path is resolved relative to shared.test.ts, not cwd.

Validation:

bun test apps/cli/test/commands/compare/compare.test.ts apps/cli/test/commands/results/shared.test.ts — 55 pass
bun test apps/cli/test/commands/results — 193 pass
bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 45 pass
bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass
bun test apps/cli/test/commands/compare/compare.test.ts — 50 pass
bun run typecheck — pass
bun run lint — pass
git diff --check — pass

christso · 2026-06-17T11:13:53Z

Re-review result

No blockers remain for the prior results-schema feedback.

Verified the requested points:

Trace invalid-score expectations are fixed: bun test apps/cli/test/commands/trace/trace.test.ts passed, 49 tests.
Historical camelCase replay rows still normalize: results/shared.test.ts and artifact-writer.test.ts cover the fixture path, timing/token/cost/status aliases, and trace.toolCalls; an extra smoke confirmed canonical test_id/duration_ms values win when both snake_case and camelCase are present.
Compare/results fixture lookup no longer depends on process.cwd(): root-run tests passed, and bun test test/commands/results/shared.test.ts test/commands/compare/compare.test.ts also passed from apps/cli cwd, 55 tests.
Canonical snake_case result rows remain preferred, while eval-case-only JSONL still rejects with the migration guidance (Eval-case JSONL is input data...).

Local verification:

bun install and bun run build for this fresh worktree setup.
bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass.
bun test apps/cli/test/commands/results — 193 pass.
bun test apps/cli/test/commands/compare/compare.test.ts — 50 pass.
bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 45 pass.
bun test test/commands/results/shared.test.ts test/commands/compare/compare.test.ts from apps/cli — 55 pass.
bun run typecheck — pass.
git diff --check origin/main...HEAD — pass.

GitHub checks on the latest commit are green: Build, Test, Typecheck, Lint, Validate Evals, Validate Marketplace, Check Links, and Cloudflare Pages all pass.

Residual risk: this was a targeted re-review of the prior blockers rather than a full functional UAT pass. Given the local targeted coverage plus green CI, I do not see remaining blocking risk for PR #1400.

fix(results): handle replay row schema aliases

959e935

fix(inspect): preserve trace invalid score errors

c234fd0

fix(results): preserve result-score validation errors

239ab03

christso marked this pull request as ready for review June 17, 2026 11:25

christso merged commit 51bb245 into main Jun 17, 2026
8 checks passed

christso deleted the av-results-row-schema-cleanup branch June 17, 2026 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(results): handle replay row schema aliases#1400

fix(results): handle replay row schema aliases#1400
christso merged 3 commits into
mainfrom
av-results-row-schema-cleanup

christso commented Jun 17, 2026

Uh oh!

christso commented Jun 17, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

christso commented Jun 17, 2026

Uh oh!

christso commented Jun 17, 2026

Uh oh!

christso commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Jun 17, 2026

Summary

Validation

Red/Green UAT

Post-Deploy Monitoring & Validation

Uh oh!

christso commented Jun 17, 2026

Review Findings

P1 - Adjacent trace tests fail after the invalid-score error changed

Verification

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jun 17, 2026

Uh oh!

christso commented Jun 17, 2026

Uh oh!

christso commented Jun 17, 2026

Re-review result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jun 17, 2026 •

edited

Loading